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VLASOV ON GPU (VOG PROJECT)* ** * 



M. Mehrenberger^ C. Steiner^, L. Marradi^, N. Crouseilles^, E. 

SONNENDRUCKER^ AND B. AfEYAN^ 

Abstract. This work concerns the numerical simulation of the Vlasov-Poisson equation using semi- 
Lagrangian methods on Graphical Processing Units (GPU). To accomplish this goal, modifications to 
traditional methods had to be implemented. First and foremost, a reformulation of semi-Lagrangian 
methods is performed, which enables us to rewrite the governing equations as a circulant matrix 
operating on the vector of unknowns. This product calculation can be performed efficiently using FFT 
routines. Second, to overcome the limitation of single precision inherent in GPU, a 6f type method 
is adopted which only needs refinement in specialized areas of phase space but not throughout. Thus, 
a GPU Vlasov-Poisson solver can indeed perform high precision simulations (since it uses very high 
order of reconstruction and a large number of grid points in phase space). We show results for more 
academic test cases and also for physically relevant phenomena such as the bump on tail instability 
and the simulation of Kinetic Electrostatic Electron Nonlinear (KEEN) waves. 

Resume. Ce travail concerne la simulation numerique du modele de Vlasov-Poisson a I'aide de 
methodes semi-Lagrangiennes, sur des architectures GPU. Pour cela, quelques modifications de la 
methode traditionnelle ont du etre effectuees. Tout d'abord, une reformulation des methodes semi- 
Lagrangiennes est proposee, qui permet de la reecrire sous la forme d'un produit d'une matrice circu- 
lante avec le vecteur des inconnues. Ce calcul pent etre fait efficacement grace aux routines de FFT. 
Puis, pour contourner le probleme de la simple precision, une methode de type Sf est utilisee. Ainsi, 
un code Vlasov-Poisson GPU permet de simuler et de decrire avec un haut degre de precision (grace 
a I'utilisation de reconstructions d'ordre eleve et d'un grand nombre de points de I'espace des phases) 
des cas tests academiques mais aussi des phenomenes physiques pertinents, comme la simulation des 
ondes KEEN. 
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Introduction 

At the one body distribution function level, the kinetic theory of charged particles interacting with electro- 
static fields and ignoring collisions, may be described by the Vlasov-Poisson system of equations. This model 
takes into account the phase space evolution of a distribution function fit^x^v) where t > denotes time, x 
denotes space and v is the velocity. Considering one-dimensional systems leads to the ID x ID Vlasov-Poisson 
model where the solution f{t,x,v) depends on time t > 0, space x G [0, L] and velocity v G M. The distribution 
function / satisfies 

dtf^vdJ^EdJ = 0, (1) 

where E{t,x) is an electric field. Poisson's law dictates that the charge particle distribution must be summed 
over velocity to render the self-consistent electric field as a solution to the Poisson equation: 

d,E = [ fdv- 1. (2) 

To ensure the uniqueness of the solution, we impose to the electric field a zero mean condition E{t^ x)dx = 0. 
The Vlasov-Poisson system (l)-(2) requires an initial condition f{t = O^x^v) = fo{x,v). We will restrict our 
attention to periodic boundary conditions in space and vanishing / at large velocity. 

Due to the nonlinearity of the self-consistent evolution of two interacting fields, in general it is difficult 
to find analytical solution to (l)-(2). This necessitates the implementation of numerical methods to solve it. 
Historically, progress was made using particles methods (see [4]) which consist in advancing in time macro- 
particles through the equations of motion whereas the electric field is computed on a spatial mesh. Despite the 
inherent statistical numerical noise and their low convergence, the computational cost of particle methods is very 
low even in higher dimensions which explains their enduring popularity. On the other hand, Eulerian methods, 
which have been developed more recently, rely on the direct gridding of phase space Eulerian methods 

include finite differences, finite volumes or finite elements. Obviously, these methods are very demanding in 
terms of memory, but can converge very fast using high order discrete operators. Among these, semi-Lagrangian 
methods try to retain the best features of the two approaches: the phase space distribution function is updated 
by solving backward the equations of motion {i.e. the characteristics), and by using an interpolation step 
to remap the solution onto the phase space grid. These methods are often implemented in a split-operator 
framework. Typically, to solve (l)-(2), the strategy is to decompose the multi-dimensional problem into a 
sequence of ID problems. We refer to [2,6,9, 12, 14, 15, 19,24] for previous works on the subject. 

The main goal of this work is to use recent GPU devices for semi-Lagrangian simulations of the Vlasov-Poisson 
system (l)-(2). Indeed, looking for new algorithms that are highly scalable in the field of plasmas simulations 
(like tokamak plasmas or particle beams), it is important to mimic plasma devices more reliably. Particle 
methods have already been tested on such architectures, and good scalability has been obtained as in [5,30]. 
We mention a recent precursor work on the parallelization in GPU in the context of a gyrokinetic eulerian 
code GENE [13]. Semi-Lagrangian algorithms dedicated to the simplified setting of the one-dimensionnal 
Vlasov-Poisson system have also recently been implemented in the CUD A framework (see [22,25]). In the 
latter two works, in which the interpolation step is based on cubic splines, one can see that the efficiency 
can reach a factor of x80 in certain cases. Here, we use higher complexity algorithms, which are based on 
the Fast Fourier Transform (EFT). We will see that our GPU simulations will directly benefit from the huge 
acceleration obtained for the EFT on GPU. They are thus also very fast enabling us to test and compare 
different interpolation operators (very high order Lagrangian or spline reconstructions) using a large number of 
grid points per direction in phase space. 

To achieve this task, ffexibility is required to switch easily from one representation of an operator to another. 
Here, semi-Lagrangian methods are reformulated in a framework which enables the use of existing optimized 
Fast Fourier Transform routines. This formulation gives rise to a matrix which possesses the circulant property, 
which is a consequence of the periodic boundary conditions. Let us emphasize that such boundary conditions 
are used not only in x but also in v; this is made possible by taking the velocity domain [— ^max, ^max], with v^ax 
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big enough. Note also that the proof of convergence of such numerical schemes can be obtained following [3,7]. 
Due to the fact that such matrices are diagonalizable in a Fourier basis, the matrix vector product can be 
performed efficiently using FFT. In this work, Lagrange polynomials of various odd degrees {2d + 1) and B- 
spline of various degree k have been tested and compared. Another advantage of the matrix-vector product 
formulation is that the numerical cost is almost insensitive to the order of the method. Finally, since single 
precision computations are preferable to get maximum performance out of a GPU, other improvements have to 
be made to the standard semi-Lagrangian method. To achieve the accuracy needed to observe relevant physical 
phenomena, two modifications are proposed: the first is to use a df type method following [22]. The second is 
to impose a zero spatial mean condition on the electric field. Since the response of the plasma is periodic, this 
is always satisfied. 

The rest of the paper is organized as follows. First, the reformulation of the semi-Lagrangian method using 
FFT is presented for the numerical treatment of the doubly periodic Vlasov-Poisson model. Then, details 
of the GPU implementation are given, highlighting the particular modifications that were necessary in order 
to overcome the single precision limitation of GPUs. We then move on to show numerical results. These 
involve several comparisons between the different methods and orders of numerical approximation and their 
performances on GPU and CPU on three canonical test problems. 

1. FFT IMPLEMENTATION 

In this section, we give an explicit formulation of semi-Lagrangian schemes for the solution of the Vlasov- 
Poisson system of equations in the doubly periodic case using circulant matrices. First, the classical directional 
Strang splitting (see [9,27]) is recalled. Then, the problem is reduced to a sequence of one-dimensional con- 
stant advections; Irrespective of the method or order of the interpolation in a specific class, a circulant-matrix 
formulation is proposed, for which the use of Fast Fourier Transform is very well suited. 

1.1. Strang-splitting 

For the Vlasov-Poisson set of equations (l)-(2), it is natural to split the transport in the x-direction from the 
transport in the ^-direction. Moreover, this also corresponds to a splitting of the kinetic and electrostatic 
potential part of the Hamiltonian |'^P/2 + (/>(^,^) where the electrostatic potential <p is related to the electric 
field through E{t^x) = —dx4>{t-,x). 

For plasmas simulations, even when high order splittings is possible (see [11] and references therein), the sec- 
ond order Strang splitting is a good compromise between accuracy and simplicity, which explains its popularity. 
It is composed of three steps plus an update of the electric field before the advection in the ^-direction 

(1) Transport in v over At/2: compute f^{x^v) = g{At/2^x^v) by solving 

dtg{t, X, v) + E'^{x)dyg{t, x, v) = 0, 

with the initial condition g{0^x^v) = f^{x,v). 

(2) Transport in x over At: compute f'^'^{x^v) = g{At,x^v) by solving 

dtg{t, X, v) + vdxg{t, x, v) = 0, 
with the initial condition g{0^x^v) = f'^{x^v). 

Update of electric field E^^^{x) by solving dxE^^'^{x) = J f^*{x^v)dv — 1. 

(3) Transport in v over At/2: compute f^~^^{x^v) = g{At/2^x^v) by solving 

dtg{t, X, v) + E''^\x)dyg{t, x, v) = 0, 
with the initial condition g{0,x,v) = f'^'^{x^v). 
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One of the main advantages of this sphtting is that the algorithm reduces to solving a series of one-dimensional 
constant coefficient advections. Indeed, considering the transport along the x-direction, for each fixed f , one 
faces a constant advection. The same is true for the 'u-direction since for each fixed does not depend on 

the advected variable v. We choose to start with the advection in f , which permits to get the electric field at 
integer multiples of time steps. The third step of the n^^ iteration could be merged with step (1) of the (n + 1)*^ 
iteration, but we do not resort to this short cut here. 

1.2. Constant advection 

In this part, a reformulation of semi-Lagrangian methods is proposed, in the case of constant advection 
equations with periodic boundary conditions. Let us consider u = u{t, x) to be the solution of the following 
equation for a given c G M: 

dtu + cdxU = 0, u{t = 0, x) = uo{x), 
where periodic boundary conditions are assumed in x e [0,L]. The continuous solution satisfies for all t^s>0 
and all x G [0, L]: u{t, x) = u{s, x — c{t — s)). Let us mention that x — c{t — s) has to be understood modulo L 
since periodic boundary conditions are being considered. 

Let us consider a uniform mesh within the interval [0, L]: Xi = iAx for i = 0, . . . , and Ax = L/N. We 
also introduce the time step At = f^^^ — for n G N. Note that we have = vJ^. By setting 

/ ^0 \ 



u = 



uitji^ Xi) , 



(3) 



the semi-Lagrangian scheme reads u^^^ = 7ru^{xi — cAt) where tt is a piecewise polynomial function which 
interpolates for i = 0, . . . , TV — 1: 7i{xi) = . This can be reformulated into u^^^ = Au^ where A is the 
matrix defining the interpolation. Periodic boundaries imply that the matrix A is circulant: 



A = C(ao,ai,...,aiv-i) := 



/ ao ai ... 



CtN-l \ 
CiN-2 



(4) 



Obviously, this matrix depends on the choice of the polynomial reconstruction tt. In the following, some explicit 
examples are shown. 

Examples of various methods and orders of interpolation 

We have to evaluate 7ru'^{xi — cAt). Let P := —cAt/ Ax be the normalized displacement which can be written 
in a unique way diS /3 = b-\-b^ with (6, 6*) G Z x [0, 1[. This means that the feet of the characteristics {xi — cAt) 
belong to the interval [x^* , [ with + 6* = z + /3, or = z + 6. 
(1) Lagrange 1. The nonvanishing terms of the matrix A are: 



l-h\ 



ab = i - , a^^^ 

(2) Lagrange 2d-\-l (with 2c^ + 1 < TV — 1). The nonvanishing terms of matrix are : 



Vj G {-d,...,d-\-l}, ab^j 



n 



k 



k=—d, k^j 



k 
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(3) B- Spline of degree k. 

We define Bf{x) the B-spline of degree k on the mesh {xi)i by the following recurrence: 

B^x) = BHx) = "-j^BtHx) + (l - i^^^) 

Then, in this case, the nonvanishing terms of the matrix A are: 




N-k 



where the nonvanishing terms of the circulant matrix M are: 

Vj G {0, . . . , k}, rrib-j = B^{xj+b^). 

Now, starting from this reformulation, the algorithm reduces to a matrix vector product at each time step. 
Since the matrices are circulant, this product can be performed using FFT. Indeed, circulant matrices are 
diagonalizable in Fourier space [18] so that 

A = UDU'', 

where U is unitary {U"^ denotes the adjoint matrix of U) and D is diagonal. They are given by 

Um,k = m,/c = O...A/'-l, 

N-l 

Dm,m = ^a^e-^^™^/^, m = 0,...,7V-l. 

The product of [/ by a vector v G can then be obtained performing the Fast Fourier Transform of In the 
same way, U'^v can be obtained by computing the inverse Fourier Transform of v. 

The product matrix vector Au'^ = UDU^u^ is then computed following the algorithm: 

(1) Compute Wu"" by calculating u = FFT"^(^x^). 

(2) Compute D by calculating FFT(a). 

(3) Compute w = DU^u^ by calculating Du. 

(4) Compute Au'^ by calculating FFT(i(;). 

The complexity of the algorithm is then 0{N\ogN)^ independently of the degree of the polynomial reconstruc- 
tion. 

2. CUDA GPU IMPLEMENTATION 

We use kernels on GPU by using existing NVIDIA routines for FFT, transposition and scalar product. Note 
that such a choice has also been made in the more difficult context [13]. We would have liked to use OPENCL 
(as done in [10]) in order not be attached to NVIDIA cards; but we had difficulties achieving the friendly 
well-documented features of NVIDIA, especially for the FFT. 

FFTs are computed using the cufft library. For transposition, different possible algorithms are provided. The 
condition N = = Ny is always required for this step. In order to compute charge density p = J f(t^ x, v)dv^ 
we adapt ScalarProd routine. 

We also write a kernel on GPU for computing coefficients of the A matrix. An analytical formula is used 
for each coefficient a^. In the case of Lagrange interpolation of degree 2(i + 1, the complexity switches from 
0{Nd) to 0{Nd^) operations because of a rewritten CPU divided differences based algorithm which cannot be 
parallelized. 

The main steps of the algorithm are : 
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• Initialisation: the initial condition computed on CPU and transferred to GPU 

• Computation of initial charge density p on GPU by using ScalarProd 

• Transfer of p to CPU 

• Computation of the electric field E on CPU 

• Time loop 

1. At/2 advection in v with FFT on GPU 

2. Transposition in order to pass into the x-direction on GPU 

3. At advection in the x direction with FFT on GPU 

4. Transposition in order to pass into the ^-direction on GPU 

5. Computation of p on GPU by using ScalarProd 

6. Transfer of p to CPU 

7. Computation of the electric field E on CPU 

8. At/2 advection in v with FFT on GPU 

3. Questions about single precision 

In principle, computations on GPU can be performed using either single or double precision. However, the 
numerical cost becomes quite high when one deals with double precision (we will see in our case, that the cost is 
generally a factor of two) and is not always easily available across all platforms. Note that in [13] and [25], only 
double precision was used. Discussions about single precision have already been presented in [22]. Hereafter, 
we propose two slight modifications of the semi-Lagrangian method which enable the use of single precision 
computations while at the same time recovering the precision reached by a double precision CPU code. 

3.1. 5f method 

The 5f method consists on a scale separation between an equilibrum and a perturbation so that we decompose 
the solution as 

/(X, V) = 8f{x, V) + /eq(v), /eq(v) = —?= eX.-^{-v'^ / 2) . 

y In 

Then, we are interested in the time evolution of 5f which satisfies 

dtSf + vdjf + Ed,[U^ + 5f] = 0. 

The Strang splitting presented in subsection 1.1 is modified since we advect 5f instead of /. Since /eq only 
depends on v, advections in x are not modified. Now we can rewrite the w-advection as 

9*[/eq + <^/]+^*5„[/eq + <5/] =0, 

with the initial condition /eq + 5f'^. This means that (/eq + 5f) is preserved along the characteristics (/eq + 
/*^)(x, = (/eq + f-'){x,v- AtE'^ix)). We then deduce that 

(5/**(x, V) = Sr{x, V - AtE'^ix)) + /eq(^ - AtE"^ (x)) - /eq(^). 

which provides the update of 5f for the I'-advection. Note that /eq('^ — AtE'^ix)) is an evaluation and not an 
interpolation. 

3.2. The zero mean condition 

The electric field is computed from (2). Note that the right hand side of (2) has zero mean, and the resulting 
electric field has also zero mean. This is true at the continuous level; however when we deal with single precision, 
a systematic cumulative error could occur here. In order to prevent this phenomenon, we can enforce the zero 
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mean condition on the discrete grid numerically: from ~ p(^"> Xk) = /{t", Xk, v)dv, k = 0, . . . , N — 1, we 
compute the mean 

N-l 

/c=0 

and then subtract this value to p^: 

p^ = p^-M, ^ = 0,...,7V-1, 

so that pit^^Xk) — 1 is of zero mean numericahy. Without this modification, we would have Pk — Pk ~ 1- 
We repeat this same procedure once the electric field is computed: from a given computed electric field k = 
0, . . . , — 1, which may not be of zero mean, we compute M = Ylk=Q 

El = El-M, k = Q,...,N -I. 

For computing the electric field, we use the trapezoidal rule: 

= ^fc + Ax ^ A: = 0, . . . , AT - 1, with set arbitrarily to zero, 

or Fourier (with FFT). Note that in the case of Fourier, the zero mean is automatically satisfied numerically, 
since the mode which represents the mean is set to 0. 

We will see that this zero mean condition is of great importance in the numerical results. It has to be satisfied 
with enough precision. It can be viewed as being related to the "cancellation problem" observed in PIC 
simulations [20]. Note also, that by dealing with (5/, which is generally of small magnitude, a better resolution 
of the zero mean condition is reached. 



4. Numerical results 

This section is devoted to the presentation of numerical results obtained by the following methods: the 
standard semi-Lagrangian method (with various different interpolation operators), including the 5f and zero 
mean condition modifications. Comparisons between CPU and GPU simulations and discussions about the 
performance will be given on three test cases: Landau damping, bump on tail instability, and KEEN waves. 
As interpolation operator, we will use use by default LAG17, the Lagrange 2d ^ 1 interpolation with d = 8. 
Similarly, LAGS stands for d = 1 and LAG9 for d = A. We will also show simulations with standard cubic 
splines (for comparison purposes), which correspond to B-splines of degree k with k = A. We will use several 
machines for GPU: MacBook, irma-gpul and hpc. See subsection 4.4 for details. 

4.1. Landau Damping 

For this first standard test case [21], the initial condition is taken to be: 

1 f v'^\ 

fo{x,v) = -j=ex.p ( ~y ) (1 +Q^COS(0.5X)), {x,v) G [0,47r] X [-^max, ^max], 

with a = 10~^. We are interested in the time evolution of the electric energy Se{t) = {l/2)\\E{t)\\j^2 which is 
known to be exponentially decreasing at a rate 7 = 0.1533 (see [29]). Due to the fact that the electric energy 
decreases in time, this test emphasizes the difference between single and double precision computations. 

Numerical results are shown on Figure 1 (top and middle left). We use LAG17, N = 2048, '^max = 8 as 
default values. 

In the single precision case (top left), we see the benefit of using the zero mean modifications (plots 6 and 7: 
efft nodelta and trap zero mean nodelta): the two results are similar (we use the trapezoidal rule for the electric 
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field or Fourier and we recall that in both cases, the zero mean is satisfied) and improved with respect to the 
case where the zero mean is not enforced in the trapezoidal case (plot 8: trap nodelta). 23 right oscillations 
are reached until time t = 50 for plots 6 and 7 (the two last oscillations are however less accurately described), 
whereas we have only 16 right oscillations until time t = 34.8 for plot 8, before saturation. If we use the Sf 
method, we observe a further improvement (plots 1 to 5): we gain 4 oscillations (that is we have 27 oscillations 
in total) until time t = 60, and the electric field is below 6 • 10~^ < e~^^. Note that in the case where we use 
the 5f method, adding the zero mean modification has no impact here; on the other hand, results with the 6f 
method are better than results with the zero mean modification on this picture. We have also added a result 
on an older machine (plot 9: MacBook), which leads to very poor results (only 9 oscillations until time t = 19 
for the worst method). Also the results, which are not shown here, due to space limitations, were different by 
applying the modifications; as an example, we got 28 right oscillations by using the Sf method with zero mean 
modification. Floating point standard may not have been satisfied there which could explain the difference in 
the results. 

In the double precision case (top right), we can go to higher precision results. By using Sf method or zero 
mean modification (the difference between both options is less visible), we get 92 right oscillations until time 
t = 206 (the last oscillation is not good resolved hat the end), the electric field goes under 6 • 10~^^ < e~^^, 
and we guess that we could add 11 more oscillations until time t = 231 (we see that grid side effects polllute 
the result), to obtain 103 oscillations and with electric field below 6 • 10~^^ < e~^^, but we are limited here in 
the double precision case to N = 2048. A CPU simulation with N = 4096 confirms the results. We also see 
the effect of the grid (runs with N = 1024) and the velocity (runs with Vmax = 6). Note that the plot 6 (trap 
nodelta 1024 v6) has lost a lot of accuracy compared to the other plots: the grid size is too small {N = 1024), 
the velocity domain also {vmax = 6) and above all there is no zero mean or Sf method. In that case, we 
only reach time t = 100. We refer to [17, 31] for other numerical results and discussions and to the seminal 
famous work [23] for theoretical results. In [31], it was already mentionned that we have to take the velocity 
domain large enough and to take enough grid points. Concerning GPU and single precision, the benefit of a Sf 
method was also already treated in [22]: 29 right oscillations were obtained in the single precision case with a 
Sf modification, 13 right oscillations without the modification and the time t = 100 was reached in the CPU 
case {N was set to 1024 and 'Umax to 6). 

On Figure 1 middle left, we plot the error of mass, which is computed as |po — 1|. We clearly see the impact 
between the conservation of the mass and the former results. We can also note that, the zero mean modification 
does not really improve the mass conservation (just a slight improvement at the end, plots 2,3,4), but has a 
benefic effect on the electric field: the bad behaviour of the mass conservation is not propagated to the electric 
field. On the other hand, the Sf method clearly improves the mass conservation. We also see the effect of taking 
a too small velocity domain, in the double precision case. 

4.2. Bump on tail 

For this second standard test case, the initial condition is considered as a spatial and periodic perturbation 
of two Maxwellians (see [27]) 



The Vlasov-Poisson model preserves some physical quantities with time, called Casimir functions, which will be 
used to compare the different implementations. Particulary, we look at the time history of the total energy £ 
of the system, which is the sum of the kinetic energy and the electric energy 



lOV^ 



2 



exp(-2(v-4.5)^) (l + 0.03cos(0.3x)), {x,v) e [0, 207r] x [-9,9]. 




As in the previous case, the time evolution of the electric energy is chosen as a diagnostics. 



ESAIM: PROCEEDINGS 



9 



Results are shown on Figure 1 (middle right and bottom) and on Figure 2. 

We see on Figure 1 middle right the evolution in time of the electric field. Single and double precision results 
are compared. In the single precision case, the Sf method with FFT computation of the electric field (plot 
3: single delta) is the winner and the basic method without modifications and trapezoidal computation of the 
electric field (plot 7: trap single no delta) leads to the worst result. Double precision computations lead to 
better results and differences are small: plots 1 (double delta) and 2 (double no delta) are undistinguishable 
and plot 8 (trap double no delta) is only different at the end. Thus, such modifications are not so mandatory in 
the double precision case. We then see for the same runs, the evolution of the error of mass (bottom left) and 
of the first mode of p in absolute value (bottom right). We notice that the error of mass linearly accumulates 
in time. Here no error coming from the velocity domain is seen, because 'Umax is large enough ('Umax = 9 in all 
the runs). The evolution of the first mode of p is quite instructive: we see that it exponentially grows from 
round off errors and the different runs lead to quite different results. The loss of mass can become critical in the 
single precision case (no real impact in the double precision case are detected) and implementations without 
mass error accumulation would be desirable. The Sf method improves the results, but the error of mass still 
accumulates much more than in the double precision case. 

On Figure 2, we see the same diagnostics in the double precision case. We make vary the number of grid 
points, the degree of the interpolation and the time step. By taking smaller time step, we can increase the 
time before the merge of two vortices among three which leads to a breakdown of the electric field. Higher 
degree interpolation lead to better results (in the sense that the breakdown occurs later), for not too high grid 
resolution. When N = 2048, lower order interpolation seems to be better, since it introduces more diffusion, 
whereas high order schemes try to capture the small scales, which are then sharper and more difficult to deal 
with in the long run. Adaptive methods and methods with low round-off error in the single precision case may 
be helpful to get better results. 

4.3. KEEN Waves 

In this last and most intricate test, instead of considering a perturbation of the initial data, we add an 
external driving electric field £^app to the Vlasov-Poisson equations: 

St/ + ^S,/+(^-^app)S,/ = 0, S,^= / fdv-1, 

JR 

where ^app(^5^) is of the form Eg,pp{t^x) = E^g,^ka{t) sm{kx — ujt), where 

0.5(tanh(^) -tanh(^)) -e tn-tr tn-tn 

a{t) = ^-^^^ ^-^^^ , e = 0.5(tanh(^^^) - tanh(^^^)) 

is the amplitude, to = 0, II = 69, = 307, t^L = t^R = 20, k = 0.26, uj = 0.37 and £^max = 0.2. The initial 
condition is ^ 

fo{x,v) = ^exp (^-y^ , {x,v) e [0,27r/k] x [-6,6]. 

See [1, 28] for details about this physical test case. Its importance stems from the fact that KEEN waves 
represent new non stationary multimode oscillations of nonlinear kinetic plasmas with no fiuid limit and no 
linear limit. They are states of plasma self-organization that do not resemble the (single mode) way in which 
the waves are initiated. At low amplitude, they would not be able to form. KEEN waves can not exist off 
the dispersion curves of classical small amplitude waves unless a self-sustaining vortical structure is created in 
phase space, and enough particles trapped therein, to maintain the self-consistent field, long after the drive field 
has been turned off. For an alternate method of numerically simulating the Vlasov-Poisson system using the 
discontinuous Galerkin approximation, see [8] for a KEEN wave test case. 

As diagnostics, we consider here different snapshots of / — /o at different times: t = 200, t = 300, t = 400, t = 600 
and t = 1000. 
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We first consider ttie time t = 200 (upper left in Figure 3). At this time, all the snapshots are similar so we 
present only one (GPU single precision and a grid of 1024^ points). The five others graphics of this figure are 
taken at time t=300. We show that, at this time, there is again convergence because the graphic on the middle 
right (GPU single precision, N = 4096 and At = 0.1) is identical to the bottom left one (GPU single precision, 
N = 4096 and At = 0.01). 

The Figure 4 presents different snapshots at times t = 400 and t = 600. At time t = 400, the upper left 
graphic (GPU single precision, N = 2048, At = 0.1) is similar to the upper right one (CPU, N = 2048, 
At = 0.05), that shows that the CPU and the GPU codes give the same results. With 4096 points (on middle 
left), we observe a little difference with the 2048 points case. Between the snapshots at time t = 400 and those 
at time t = 600, we observe the emergence of diffusion. 

The time t = 1000 is considered on Figure 5. We see that there is no more convergence at this time: 
there is a lag, but the structure remains the same. We compare also different interpolators (cubic splines, 
LAG 3, LAG 9, LAG 17). If the order of the interpolation is high (graphic at the top right : CPU, LAG 17, 
At = 0.05, A^a; = 512, A^^ = 4096) there is appearance of finer structures. At this time, one sees little difference 
between CPU results (graphic at the middle right : CPU, LAG 3, At = 0.05, N = 4096) and GPU results 
(graphic at the bottom left : GPU, LAG 3, At = 0.05, N = 4096), but there is no lag. 

The Figure 6 (at time t = 1000) shows the differences between single and double precision when the value of 
N is changed. The two graphs above show the case N = 1024, the left is single precision while the right one is in 
double precision. We see that there are very few differences. When N = 2048, the results are different in single 
precision (graphic on middle left) and double precision (graphic on middle right). When N = 4096, the code 
does not work in double precision so we compared the results for single precision GPU with At = 0.05 (bottom 
left graphic) and At = 0.01 (bottom right graphic). There are also differences due to the non-convergence. 
Moreover, we see that there are more filament ations when N increases. 

The Figure 7 shows the time evolution of the absolute value of the first Fourier modes of p. We see that single 
precision can modify the results on the long time (top left). The GPU code is validated in double precision 
(top right). We clearly see the benefit of the 5f method in the GPU single precision (middle left), where it has 
no effect in the double precision case (middle right). Further plots are given with N = 4096 (bottom left and 
right). With smaller time steps, some small oscillations appear with single precision GPU code (bottom right). 
In all the plots, we see no difference at the beginning; differences appear in the long run as it was the case for 
the plots of the distribution function. 



4.4. Performance results 

Characteristics. We have tested the code on different computers with the following characteristics: 

• GPU 

- (1) = irma-gpul : NVIDIA GTX 470 1280 Mo 

- (2) = hpc : GPU NVIDIA TESLA C2070 

- (3) = MacBook : NVIDIA GeForce 9400M 

• CPU 

- (4) = MacBook : Intel Core 2 Duo 2.4 GHz 

- (5) = irma-hpc2: Six-Core AMD Opteron(tm) Processor 8439 SE 

- (6) = irma-gpul: Intel Pentium Dual Core 2.2 Ghz 2Gb RAM 

- (7) = MacBook : Intel Core 15 2.4 GHz 

We measure in the GPU codes the proportion of FFT which consists in: transform ID real data to complex, 
computing the FFT, making the complex multiplication, computing the inverse FFT, transforming to real data 
(together with addition of Sf modification, if we use the Sf method). We add a diagnostic for having the 
proportion of time in the cuf ftExec routine; we note that this extra diagnostic can modify a little the time 
measures (when this is the case; new measures are given in brackets, see on Table 1). 
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When the number of cehs grows, the proportion of FFT time in total time grows, as shown on Table 1 (KEEN 
wave test case with 6f modification) or Table 2 (KEEN wave test case without 5f modification). Note that the 
initialisation time and the 2d-diagnostic time are not included in total time. 

The results with a CPU code (vlaso) without OpenMP are given on Table 3, top; in that code, the Landau 
test case is run with At/2 advection in x, followed by At advection in v and At/ 2 advection in x and the last 
advection in x of iteration n is merged with the first advection in x of iteration n + 1. 

The results with Selalib (Table 3, bottom) [26] are obtained with OpenMP. We use 2 threads for (4), 24 
threads for (5), 2 threads for (6) and 4 threads for (7). 

In order to compare the performances, we introduce the number MA which represents the number of millions 
of point advections made per second : MA = ^Q6xT^ai\h^ number of operations per second (in 

GigaFLOPS) given by : 

GF = ^-^^e.xiV...x (2iVx5iVlog(iV) + 6iV^) ^^i,, ,,,, (^pU) 

109 X Total time ^ ^ ^ 

GF = ^-^^e.xiV„,.x(iVx5iVlog(iV) + 3iV^) ^j,, ,,,, (^PU) 

10^ X Total time ^ ^ 

where Ngtep refers to the number of time steps and Nadv represents the number of advections made in each time 
step {Nadv = 3 in GPU and Selalib codes; Nadv = 2 in vlaso code). In each advection, we compute N times 
(GPU in complex data) or N/2 times (CPU in real data) : 

• A forward FFT and backward FFT with approximately 5A^ log(A^) operations for each FFT computation 

• A complex multiplication that requires 6N operations. 

The comparison between the Table 1 and the Table 2 shows that the cost of the method Sf is not too 
important but not negligible. This cost could be optimized. We clearly benefit of the huge acceleration of the 
FFT routines in GPU and we thus gain a lot to use this approach. Most of the work is on the FFT, which 
is optimized for CUD A in the cufft library, and is transparent for the user. Note that we are limited here to 
N = 4096 in single precision and N = 2048 in double precision; also we use complex Fourier transform; optimized 
real transforms may permit to go even faster. The merge of two velocity time steps can also easily improve the 
speed. Higher order time splitting may be also used. Also, a better comparison with CPU parallelized codes 
can be envisaged (here, we used a basic OpenMP implementation which only scales for 2 processors). We can 
also hope to go to higher grids, since cufft should allow grid sizes of 128 millions elements in double precision 
and 64 millions in single precision (here we use 2^^ :^ 16.78 • 10^ elements in single precision and 2^^ 4.2 • 10^ 
elements in double precision; so we should be able to run with N = 8192 in single precision and N = 4096 
in double precision). Higher complexity problems (as AD simulations) will probably need multi-gpu which is 
another story, see [13] for such a work. 

5. Conclusion 

We have shown that this approach works. Most of the load is carried by the FFT routine, which is optimized 
for CUD A in the cufft library, leading to huge speed-ups and is invisible to the user. Thus, the overhead of 
implementation time which can be quite significant in other contexts is here reduced, since we use largely built- 
in routines which are already optimized. The use of single precision is made harmless thanks to a Sf method. 
We however are not able to get as precise results as in the case of double precision. The test cases we chose are 
quite sensitive to single precision round off errors. We point out also that the electric field has to satisfy a zero 
mean condition with enough accuracy on a discrete grid. For the moment, we are limited to same sizes in x and 
V (needed here for the transposition step) and to = 2048 in double precision {N = 4096 in single precision). 
We hope to implement a four dimensional (2x, 2v) version of this code, next, including weak collisions. 
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Table 1. Performance results for GPU, nbstep=1000, LAG17, KEEN wave test case with 6f 



modification: total time, speedup, MA, proportion FFT/total time (and cufftExec/total time). 
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Table 2. Performance results for GPU, nbstep=1000, LAG17, KEEN wave test case without 



Sf modification: total time, speedup, MA, GFlops and proportion FFT/total time. 
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Table 3. Performance results for CPU vlaso code, nbstep=1000, LAG 17, Landau test case 
(top): total time, MA and GFlops. Performance results for CPU Selalib code, nbstep=1000, 
LAG 17, KEEN test case without 6f modification (bottom): total time, MA and GFlops. 
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Figure 1. Linear Landau damping. N = 2048, At = 0.1, Vmax = 8, LAG17, irma-gpul on 
GPU as default. Evolution in time of electric energy in single/double precision (top left/right). 
Error of mass |po — 1| with single precision as default (middle left). Bump on tail test case. 
N = 1024, At = 0.05, LAG9, irma-gpul on GPU as default. Evolution in time of the electric 
energy/ error of mass/ first Fourier mode of p, |pi| (middle right /bottom left /bottom right). 
[ for details, see the legends, efftielectric field compute with FFT; delta= df method; no 
delta= without the Sf method; single=single precision; double:double precision; trap:electric 
field computed with trapezoidal method; zero meanizero mean modification for the electric field; 
cpu: cpu code used; 1024: N = 1024; 4096: N = 4096; v6: v^ax = 6; Macbook: MacBook 
GPU is used]. 




Figure 2. Bump on tail test case. Double precision is used, irma-gpul on GPU. Evolution in 
time of the electric energy for N = 256, 512, 1024, 2048 (top left, top right, middle left, middle 
right), with LAGS, LAG9, LAG17 recosntructions and various time steps (0.1,0.01,0.002). 
Evolution in time of the first Fourier mode , |pi| for N = 256 and N = 1024 (bottom left), and 
for N = 512 and N = 2048 (bottom left), with the same reconstructions and time steps. 



ESAIM: PROCEEDINGS 



Pseudocolor 
^I-Oj07760 

-■cm 600 

^L^^^^axo ^^^^r 

^^^ch: a. W ^^^^^ ^ 


1 

H DiaaSQQ ^^^H 




PaaudQcobr 1 
Van r 

^^O.-ISOQ 


^J<)1M0 ^^^^^ 

M^-ollF ^^^^^^^ 





Figure 3. KEEN wave test case (LAG17): f{t,x,v) - fo{x,v). At time t = 200, GPU single 
precision A/" = 1024 (top left). At time t = 300, GPU single precision N = 1024, 2048, 4096 and 
At = 0.1 (top right, middle left, middle right). N = 4096 and At = 0.01 (bottom left). CPU 
N = 2048, At = 0.1 (bottom right). {x,v) G [0, 27r/A:] x [0.18,4.14]. If not changed, from one 
picture to another (from top left to bottom right), parameters are not restated. 
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Figure 4. KEEN wave test case (LAG17): f{t,x,v) - fo{x,v). At time t = 400, GPU single 
precision N = 2048, At = 0.1 (top left). CPU At = 0.05 (top right). GPU single precision 
N = 4096, At = 0.1, at time t = 600 (middle left). GPU single precision N = 1024 (middle 
right). At = 0.01,7V = 4096 (bottom left). CPU = 512, Ny = 4096 (bottom right). 
(x^v) e [0,27r//c] X [0.18,4.14]. If not changed, from one picture to another (from top left to 
bottom right), parameters are not restated. 
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Figure 5. KEEN wave test case: f{t,x,v) — fo{x,v) at time t = 1000. CPU cubic splines, 
At = 0.05, = 512, AT^ = 4096 (top left). LAG17 (top right). N = 4096 and cubic splines 
(middle left). LAGS (middle right). GPU single precision (top left). LAG9 (bottom right). 
(x^v) G [0,27r//c] X [0.18,4.14]. If not changed, from one picture to another (from top left to 
bottom right), parameters are not restated. 
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Figure 6. KEEN wave test case (LAG17): f{t, x, v)-fo{x, v) at time t = 1000, At = 0.05, N = 
1024 as default. GPU single/double precision (top left/right). TV = 2048, GPU single/double 
precision (middle left/right). N = 4096, GPU single precision (bottom left). At = 0.01 (bottom 
right), (x^v) G [0,27r//c] x [0.18,4.14]. If not changed, from one picture to another (from top 
left to bottom right), parameters are not restated. 
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Figure 7. KEEN wave test case (LAG 17): Absolute values of the first Fourier modes of p 
(from mode /c = 1 to mode /c = 7) vs time. 5f method, with N 2048 At 0.05 GPU, 
double and single precision (lb, 2b, 3b) (top left), double GPU and double CPU (top right). 
Full version GPU in single precision and 5f version CPU (middle left). Full version and 5f 
version, in double precision (middle right). N = 4096, GPU and CPU (bottom left). GPU 
with At = 0.01 and CPU with At = 0.05 (bottom right). If not changed, from one picture to 
another (from top left to bottom right), parameters are not restated. 



